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We consider the variable selection problem in linear regression. Suppose that we have a set 
■ of random variables X\, ■ ■ • , X m , Y, e such that Y = ^2 kew ctkXk + e with 7r C {1, - • • , m} 
and «fc G R unknown, and e is independent of any linear combination of Xi, ■ ■ ■ ,X m . 
Given actually emitted n examples {(x^i ■ ■ ■ , x^ m , yi)}™ =1 emitted from {X\, ■ ■ ■ , X m , Y), 
we wish to estimate the true 7r using information criteria in the form of H + (k/2)d n , 
where H is the likelihood with respect to tt multiplied by — 1, and {d n } is a positive real 
sequence. If d n is too small, we cannot obtain consistency because of overestimation. 
For autoregression, Hannan-Quinn proved that, in their setting of H and k, the rate 
d n = 2 log log n is the minimum satisfying strong consistency. This paper solves the 
statement affirmative for linear regression as well which has a completely different setting. 
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1 Introduction 

Information criteria such as AIC, MDL/BIC are used for problems in model selection, 
and each problem is associated with estimating how many independent parameters exist 
from given finite examples: on how many variables another variable depends in linear 
regression (LR); on how many previous variables the subsequent variable depends on in 
auto regression (AR), etc. 

For each model g, we evaluate two factors: 

1. How well the examples explain the model g; and 

2. How simple the model g is. 
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and balance them numerically. Let {d n } c £ =1 be nonnegative reals such that d n /n — > 0, 
H (g) the empirical entropy which is the maximum likelihood multiplied by (—1), and k(g) 
the number of parameters in model g. By information criteria, we mean the quantity 



and we estimate the model g by finding one with the minimum value. For example, 
d n = 2 for AIC, and d n = logn for MDL/BIC. Hence, information criteria exist as many 
as sequences {dn}^^, so it is impossible to list all of information criteria in the form of 



In model selection, in particular for theoretical analyses, we often discuss if consistency 
holds for each {d n }, namely, if a sequence of selected models converges to the correct one 
as n — y oo in the following senses: 

1. the probability of the selected model for each n being correct converges to one (weakly 
consistent), and 

2. the set (event) of infinite sequences in which at most a finite number of errors occur 
has probability one (strongly consistent). 

Although both properties are satisfied in MDL/BIC (d n = logn), however, none of the 
two are satisfied in AIC (d n = 2). In general, if d n is too small, strong consistency is not 
obtained because of overestimation. 

This paper addresses the minimum order of {d n } satisfying strong consistency although 
seeking such a condition is of theoretical interest in model selection (in fact, many infor- 
mation criteria are to be satisfactory even if consistency is not achieved). 

The definitions of empirical entropy and the number of parameters are different in each 
problem to be considered. In 1979, Hannan-Quinn proved that for AR d n = 2 log logn is 
the minimum order satisfying strong consistency (Hannan-Quinn proposition). However, 
the same d n = 2 log logn has been applied to other problems as well as AR. In fact, 
the proof of the Hannan-Quinn proposition essentially depends on the properties of the 
AR problem, which is clear from the original paper by Hannan-Quinn, and the Hannan- 
Quinn proposition was not proved for any other problem including the LR problem. On 
the contrary, without noticing such a matter, the information criterion HQ was applied 
to those problems. 

Recently, the Hannan-Quinn proposition has been proved for estimating classification 
rules which has many applications such as Markov order estimation, data mining, pattern 
recognition (Suzuki, 2006). 

This paper shows that the Hannan-Quinn proposition is true for estimating dependen- 
cies in LR, which seems to be of great significance. Otherwise, there would be no reason to 
use HQ in LR. Several authors suggested that d n = clog logn with some positive constant 
c would be enough (Rao-Wu, 1989). So, there has been evidence that the proposition is 
true although no formal proof appeared. This paper proves that such a c is any constant 
strictly greater than two. 

In Section 2, we briefly overview how the Hannan-Quinn proposition was proved in 
AR. In Section 3, we derive the asymptotic error probability of model selection in LR 



H(g) + 



d, 
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when information criteria are applied, which will be an important step to prove the main 
result. In Section 4, we give a proof of the Hannan-Quinn proposition for LR. Section 5 
summarizes the results in this paper and gives a future problem. 

Throughout the paper, we denote by X(fl) the image {X(u)\u G f2} of a random 
variable X : Q — > M, where Q is the underlying sample space. 

2 Auto Regression 

Let {M / j}^ : _ 00 be a sequence of independent and identically distributed random variables 
with expectation zero and variance one, and let {X i }°Z_ OQ be defined by 

k 

X i = J2 *J X i-j + Wi 

and a nonnegative real sequence {Aj}^ =1 , where we assume the expectation of each X^ to be 
zero. Since {X^ is stationary, we obtain for m > 0, the following equation (Yule-Waker) 

k 

where 7 m := EXiX i+m does not depend on i. Using Cramer's formula, and from the 
values of {7 m }^ =0 , we obtain the values of A := a\ and {\ m }m=i as a solution of the 
(k + 1) x (k + 1) linear equations: 



-1 


7i 


72 


• Ik 




■ 4 ■ 




' -7o " 





7o 


7i • 


■ 7fc-i 




Ai,fc 




-7i 





7i 


7o ■ 


• 7fc-2 




^2,k 




-72 





7fc-i 


7fc-2 • 


• 7o . 




. ^k,k . 




. ~lk . 



Since the values of { r y m }^ n=0 are generally unknown, we need to estimate 

1 n 

X . > Xi 

n 

i=i 

and 

^ n—m 

7m • 7— m • ^ ^ (^t 2-)(2-i+m 

«=1 

from the examples 

x n = (xi,-- - ,x n ) G Xi(fi) x ••• x X n (VL) . 
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Then, we obtain the Yule- Walker equation as follows: 
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(2) 



In particular, if the order k is unknown, we solve the above linear equation for each k to 
calculate the value of 



L(x n ,/c) = -l g^ + -d n . 



(3) 



We estimate the true k = k* by the one k = k that minimizes ([3]). This process is called 
estimating the AR order. Then, we also obtain the solutions A £ := <r| and {^ m {\m=i °f 
(ED with fc = fc. 



In general, 



thus for each k = 1,2, 



we have 

2{L{x n ,k) -L{x n ,k- 1)} 
= n log 72 h d n 



a 



fe-l 

* 2 

< -n(l-^)+d n 
°fc-l 



(4) 



~ n Kk + rf n ■ 



As n — > oo, for k < k*, ^ — almost surely converges to a value less than one. Thus, from 



a 



k-l 



(J4j), we have with probability one 

L{x n , 0) > L{x n , !)>■■■> L{x n , k* - 1) > L{x n , k* 



On the other hand, for k > k* + 1 



at 



' ~2 



a 



almost surely converges to one. Hannan- 



fc-i 



Quinn(1979) proved from the law of iterated logarithms that 

< 1 



X 2 

A k,k 



In 1 log log n 

with probability one, and that for d n = 2c log log n (c > 1), 

L{x n ,k*) < L{x n ,k* + 1) < ... 

with probability one. 
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3 Linear Regression 



Let Xi, ■ ■ ■ ,X m be random variables such that there are no linear relations: any linear 
combination of X\, ■ ■ ■ ,X m cannot be zero with probability one. Let e ~ A/"(0, a 2 ) be a 
normal random variable with expectation zero and variance a 2 > 0, and 

v 



where ot := [cci, 



G MP (0 < p < m). We assume that e is independent of any 



linear combination of Xi, • ■ • , X m . 

Suppose we do not know the values of order p and coefficients ot, and that we are given 
independently emitted n examples 



with 



yi e Y(n), [ Xi>1 , ■ ■ ■ ,x i>m ] E Xi(fi) x • • • x X m (Q) , 
where • • • ,x n j]} 1 JL 1 are to be linearly independent. If we define 





Xl,l ■ 
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ei 


X p :— 






, y := 




, e := 












. Vn . 







we can write y = X p ot + e. Suppose that we estimate p by q (0 < q < m). If 
we wish to minimize the quantity Y2i=i(Vi ~~ Yl 9 j=i®jq x ij) 2 & ven the n examples, then 
otq = \&i,q, • " " ? ®q,q] T : = (X q X q ) _1 X^y is the exact solution (minimum square error 
estimation), where 

^1,1 • • • Xl,q 



X q :-- 



X 



n,q J 



3.1 Idempotent Matrices 

Suppose p < q. If we define P q := X q (X q X g) -1 X q , we have 



P 2 = P 

r q r <i 



and 



(/ - P q f =I~Pq, 

so that the square error is expressed by 

n q 
i=l j=l 

= \\y-X q OL q \\ 2 

= W-P g )y\\ 2 

= y T {I-P q )y. 
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Similarly, if q = p, for P p := X p (XpX p ) and 6l p = [a hp , ■■■ , a PtP } T := (Xj X p ) l X T p y, 

the square error is expressed by 

S p = y T (I-P p )y . 
Thus, the difference between the square errors is 

S p -S y = y T (I - P q )y - y T (I - P q )y = y T (P q - P p )y . (5) 
On the other hand, we have 

— X q{( X q X q) T } 1 X q = Pq 

and P p = P p . From P q X p = X p , P P X P = X p , we obtain 

PqPp = Pq X p( X p X p)' 1X p = X p( X p X p)' 1X p = Pp 

and 

P P = P T P T = (P P) T = P T = P 

1 p 1 q 1 p 1 q \ r q r P> 1 p 1 P ■ 

Thus, not just for P p , I — P p but also for P q — P p , the property 

(P — P ) 2 = P 2 — P P — P P + P 2 = P — P 
y 1 q 1 p) 1 q 1 q 1 p p q p 1 q 1 p 

holds. Such square matrices satisfying the property are called idempotent matrices 
(Chatterjee-Hadi, 1987). 

In general, for idempotent matrix P G IR nxn , the inner product (Px, (I — P)x) = for 
any x = Px + (/ — P)x G M™, so that the eigenspaces are 

1. Vi := {Px\x G R n } with dim(Fi) = rank(P), and 

2. V :={(/- P)x\x G W 1 } with dim(V r ) = n - rank(P). 

Since the eigenvalues are one and zero, the multiplicity of eigenvalue one is the same as 
the trace. Notice that for (XjXg) = [yj k ] and (XjXg)^ 1 = [zj k ], 

n q q q q q 

trace(P g ) = trace{X q (X^X q )~ l Xl) = ^ y] XjjZ ik x ki = ^^VkjZjk = ^ 1 = g , 

i=\ j=i k =i j=i k =i k=i 

and trace(P p ) = p, so that we have the following table. 
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trace (P) 
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dim(Vo) 


rank(P) 


Pp 
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n — p 
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I-Pp 


n — p 


n — p 


p 


n — p 


p - p 
1 q 1 p 


q-p 


q-p 


n - q+p 


q-p 
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3.2 Error probability in model selection 

s — s 

Proposition 1 If p < q, p — - asymptotically obeys the x 2 distribution with freedom 

S p /n 

q-p. 

Proof: Given X p , we choose an orthogonal matrix U = [u\, ■ ■ ■ ,u n ] of I — P p so that 
U\ =< Ui, • ■ ■ , ii n _ p > and Uq =< u n _ p+ i, • • • ,u n > are the eigenspaces of eigenvalues 
one and zero, respectively. Notice that 

(/ - P p )y = y- (X p a + P p e) = e - P p e = (I - P p )e . (6) 

For j = 1, • • • ,n — p, multiplying uj in both hands from left, we get a normal random 
variable 

T T 

Zj := Uj y = UjS. 

Since the expectation and variance of €{ are zero and a 2 (independent), and 



UjU k 



1, j = k 
0, j^k 



we have E[zj] = and 



E[z jZk ] = E[uje ■ ule] = o*%ij'u k = | q, ' ] ^ k ' 



Thus, from the strong law of large numbers, with probability one as n — > oo, 

n—p 

n ' n 



1 1 n ~ p 

-S P = -J2 Z ^° 2 - ( 7 ) 

On the other hand, given X q , we choose an orthogonal matrix V = [vi, ■ ■ ■ ,v n ] of 
P q — P p so that V\ =< Vi, ■ ■ ■ , v q ^p > and Vb =< v q - p+ i, ■ ■ ■ ,v n > are the eigenspaces 
of eigenvalues one and zero, respectively. Notice that from ([6]), we have 

(P q - P p )y = P q (I - P p )y = P q {I - P p )e = (P q - P p )e . 

For j = 1, • • • ,q — p, multiplying Vj in both hands from left, we get a normal random 
variable 

T T 
Tj := Vj y = Vj e . 

Since the expectation and variance of £j are zero and a 2 (independent), and 



v jVk 



1, j = k 
0, j^k 



we have E[rj] = and 
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Hence, as n — > oo, 



^-2 A <? 



a 2 " — ' (7" 

where the fact that the square sum of q — p independent random variables with the 
standard normal distribution obeys the \ 2 distribution of freedom q — p has been applied. 
Equations (JTJ) (JHJ) imply Proposition 1. 

(Q. E. D.) 

In the sequel, for ir C {1, • • • , m}, we write the square error of {Xj}j e7T and Y by S(ir), 
and put 

L(z n ,7r) : =nlogS(7r) + ^rf„ 

and k(ir) = given z n = {[y^a^i, ■ ■ ■ , Xi, m ]}™ =1 . Let 7r* C {1, • • • , m} be the true ir. 
Theorem 1 For it D 7r*, the probability of L(z n ,ir) < L(z n ,ir*) is 

fk(ir)-k(;iT,)(x)dx , 

^{l-expt- ^W-M^) ^]} 

where /; is the probability density function of the y 2 distribution of freedom I. 
Proof: Notice that 

2{L(z n ,ir)-L(z n ,ir*)} 
= 2n\og^f\ + {k(ir)-k(ir,)}d n 

= 2nlog(l - S{7r *l~ f (?r) ) + {k(n) - k{<K*)}d n , (9) 

0(71-*) 

so that 

Tin \ Tin \ S{jX*) - Sill) k(n) - k(7T*) 

L {z n ,ir < L 2; n ,7r, o ; w >n{l -exp w ; d n } . 10 

From Proposition 2, we obtain Theorem 1. 

(Q. E. D.) 

Hereafter, we do not assume that ~ A/"(0, a 2 ) but that is an independently iden- 
tically distributed random variable with expectation zero and variance a 2 . 

Theorem 2 For 7r ^ 7r*, L(x n , it) > L{x n , 7r*) with probability one as n — > oo. 

Proof: Suppose q < p. Given X p , we choose an orthogonal matrix W := [wi, • • • , tu n ] of 
P p —P q so that H^i =< • • • , w p _ q > and W =< w p _ q+1 , • ■ • , w n > are the eigenspaces 
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of eigenvalue one and zero, respectively. Since {&j, P }^ = i are strongly consistent estimators 
(Lai-Robbins-Wei, 1978), we have for j — 1, • • - p — q with probability one as n — >• oo 

n n p p 

Sj ■= ^ WijVi = x ik®k, P + Hi - ^ x ik a k:P ] 

i=l i=l fe=l fc=l 

n p 

^2 w v(z2 Xikak + e ^ 
i=i fc=i 

n p 
i=l fc=q+l 

P 

where lUj := [ifij, • • • ,w n j] T . Since e and OL k X k are independent, we have for 

k=g+l 

j = 1 , • • • , p — q with probability one as n — > oo 

1 n P 2 



n J — < - ± — , l n ■ n 

' =9+1 
P 

j=l k=q+l v 

and 

p p 

n / \ ^ x l,k^k \ ^ %n,k&k - 



/ \ ^ -^ik^k 

1=1 fc=g+l V 

n p 



a; := 



yn 14 — ' \/n 

fc=g+l v fc=<?+l 



has a positive constant square norm ||a?°°|| 2 as n — )■ oo unless X]fc= g +i = with 

probability one, which contradicts our assumption. Since x n is not orthogonal to the 
space < • • • , Wp- q > and Ha? 00 !) 2 > 0, from (jSJ), 

1(5 - S ) _> lim V (wjx n ) 2 > , (11) 

77 n-)-oo z — ' J 

3=3+1 

which implies the theorem when 7r C 7r*. Suppose 7r ^ 7r*. In the same way, if we notice 
that (ITT]) is true even for q = \ir D 7r*|, so that 

lim l{S(7rn7r*) -5(tt»)} > . (12) 

n— >oo 77, 

Furthermore, if we replace 7r* by 7r fl 7r*, from a similar discussion as in Theorem 1, we 
have 

lim -{S(n) - S(7in7i,)} = . (13) 

n— >oo 77 

The statements ffT2l(fT3]) imply the theorem. 

(Q. E. D.) 
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4 Proof of the Hannan-Quinn Proposition 

Proposition 2 If q > p, with probability one, 

s — S 

P Q 9 <(q~p) log log n (14) 

Dp 

Proof: The notation is similar to Proposition 2, and let p+ 1 < j < q. For Zj, := ^ 



a 



n Far 

with Vj = [vij, • ■ ■ , v n j] T , we have = with expectation zero and variance a 2 , 

i=l 

and E[^ =1 Z t ] = 0, S[E Li ^i) ] — n - Since is independently identically distributed. 
= 0, E[Zf] = 1. From the law of iterated logarithms (Stout 1974), we have 

\/n log logn a/ n log log n ~ 

namely, 

— < a/ log logn 
cr 

with probability one, which means 

^ . — < (q — p) log log n 

with probability one. 

(Q. E. D.) 

Theorem 3 For d n := 2cloglogn (c > 1), L(z n ,n) > L(z n ,n il: ) with probability one. 

Proof: From Theorem 2, the error for 7T* % n is almost surely zero as long as — — > 

n 

(n — > oo), so that we only need to consider the case ix* C n. However, d n = 2c log log n 
with c > 1 implies the both sides of 

l{fc(7r)-fc(7r,)}rf w -^[{M^)-M^)}^] 2 < n[l-exp{- Hn) ~* M dn}] < ^{k(n)-k(n*)}d n 

(see (fT0~|) ) are at least (q—p) log logn with p = ^(71"*) and g = A;(7r) for large n (Proposition 
2), which implies Theorem 3. 

(Q. E. D.) 



5 Conclusion 



We proved that the Hannan-Quinn proposition is true for linear regression as well as for 
auto regression (Hannan-Quinn, 1979) and for classification (Suzuki, 2006): the minimum 
rate of d n satisfying strong consistency is (2 + e) log log n for arbitrary e > 0. 

The future problems contain finding strong consistency conditions that are good for 
all the cases including linear regression, auto regression, and classification. Making clear 
why the same d n = 2 log log n is the crucial rate for those problems would be the first step 
to solve the problem. 
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